Search CORE

69 research outputs found

Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

Author: Awan Ahsan Javed
Ayguade Eduard
Brorsson Mats
Vlassov Vladimir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.Comment: Accepted to The 5th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2015

arXiv.org e-Print Archive

Crossref

UPCommons. Portal del coneixement obert de la UPC

Towards Resilient EU HPC Systems: A Blueprint

INRIA a CCSD electronic archive server

Code Generation and Run-time Support For Multi-Level Parallelism . . .

Author: Xavier Martorell Eduard Ayguade , Jesus Labarta
Publication venue
Publication date
Field of study

In this paper we describe the main components of the NanosCompiler, an OpenMP compiler whose implementation is oriented towards the efficient exploitation of nested parallelism. Program parallelization relies both on the automatic parallelization capabilities of the base compiler and the information obtained from user--supplied directives. The compiler uses a hierarchical internal representation that unifies both sources of parallelism, proceeds with a task identification phase that adapts the granularity of the final tasks to the target architecture and then generates parallel code. The paper also presents an analysis of the special support needed from the threads library level to support this kind of parallelism. These requirements are analyzed in our current implementation named NthLib

CiteSeerX

Towards an efficient exploitation of loop-level parallelism in Java

Author: Eduard Ayguade
Jose Oliver
Nacho Navarro
Publication venue
Publication date: 01/01/2000
Field of study

This paper analyzes the overheads incurred in the exploitation of loop-level parallelism using Java Threads and purposes some code transformations that minimize them. Avoiding the intensive use of Java Threads and reducing the number of classes used to specify the parallelism in the application results in promising performance gains that may encourage the use of Java for exploiting loop-level parallelism. On average, the execution time for our synthetic benchmarks is reduced by 50% from the simplest transformation when 8 threads are used

CiteSeerX

Crossref

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

Author: Ayguade
Eduard Ayguade
Gabriele Jost
Jin
Marc Gonzalez
Taft
Xavier Martorell
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Task-based Parallel Breadth-First Search in Heterogeneous Environments

Author: David A. Bader
Eduard Ayguade
Lluís-miquel Munguía
Publication venue
Publication date: 01/01/2013
Field of study

Abstract—Breadth-first search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a non-trivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a fine-grained task-based parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multi-core processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence. I

CiteSeerX

UPCommons. Portal del coneixement obert de la UPC

Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

Author: Ayguade Eduard
Gonzalez Marc
Jost Gabriele
Martorell Xavier
Publication venue
Publication date: 26/04/2004
Field of study

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications

NASA Technical Reports Server